Why R?
What is R Markdown?
How should we conduct statistical analyses?
txt file # Header 1
## Header 2
Normal paragraphs of text go here.
**I'm bold**
[links!](http://rstudio.com)
* Unordered
* Lists
And Tables
---- -------
Like This
- “Literate programming”
- Embed R code in a Markdown document
- Renders textual output along with graphics
```{r chunk_name}
x <- rnorm(1000)
length(x)
qplot(x, bins = 10,
fill = I("orange"),
color = I("black"))
```
## [1] 1000
library(dplyr)
library(pnwflights14)
data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>%
na.omit() %>% select(-year, -origin)
str(object = pdx_flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 52808 obs. of 14 variables:
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 1 8 28 526 541 549 559 602 606 618 ...
## $ dep_delay: num 96 13 -2 -4 1 24 -1 -3 6 -2 ...
## $ arr_time : int 235 548 800 1148 911 907 916 1204 746 1135 ...
## $ arr_delay: num 70 -4 -23 15 4 12 -9 7 3 -30 ...
## $ carrier : chr "AS" "UA" "US" "UA" ...
## $ tailnum : chr "N508AS" "N37422" "N547UW" "N813UA" ...
## $ flight : int 145 1609 466 229 1569 649 796 1573 406 1650 ...
## $ dest : chr "ANC" "IAH" "CLT" "IAH" ...
## $ air_time : num 194 201 251 217 130 122 125 203 87 184 ...
## $ distance : num 1542 1825 2282 1825 991 ...
## $ hour : num 0 0 0 5 5 5 5 6 6 6 ...
## $ minute : num 1 8 28 26 41 49 59 2 6 18 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:527] 145 146 147 148 149 150 189 304 310 313 ...
## .. ..- attr(*, "names")= chr [1:527] "145" "146" "147" "148" ...
We randomly select 2000 flights from this set of 52808 flights.
set.seed(2016)
pdx_rs <- pdx_flights %>% sample_n(3000)
Explanatory variable: categorical
Response variable: continuous
library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")
# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(-20, 45))
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(-20, 45)) +
stat_summary(fun.y = "mean", geom = "point", color = "red")
data("airlines", package = "pnwflights14")
pdx_join <- inner_join(x = pdx_summary, y = airlines, by = "carrier")
kable(pdx_join)
| carrier | Mean Delay | Median Delay | name |
|---|---|---|---|
| AA | 23.2268908 | -2.0 | American Airlines Inc. |
| AS | 0.9781977 | -5.0 | Alaska Airlines Inc. |
| B6 | 5.2826087 | -3.0 | JetBlue Airways |
| DL | -0.0369128 | -3.0 | Delta Air Lines Inc. |
| F9 | 4.3783784 | -3.0 | Frontier Airlines Inc. |
| HA | -4.3333333 | -5.0 | Hawaiian Airlines Inc. |
| OO | 4.1436266 | -4.0 | SkyWest Airlines Inc. |
| UA | 8.6853933 | -2.0 | United Air Lines Inc. |
| US | 4.5902778 | -3.5 | US Airways Inc. |
| VX | 2.7391304 | -4.0 | Virgin America |
| WN | 13.2713816 | 2.0 | Southwest Airlines Co. |
Assuming conditions are met…
pdx_anova <- aov(formula = dep_delay ~ carrier, data = pdx_rs)
summary(pdx_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## carrier 10 103041 10304 8.96 0.0000000000000111 ***
## Residuals 2989 3437331 1150
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The \(p\)-value resulting from our analysis is 0. This corresponds to the probability of obtaining an observed \(F\) statistic of 8.960177 or greater, assuming that the means departure delays for all carriers is the same (the null hypothesis is true).
This small \(p\)-value leads us to reject the null hypothesis in favor of the alternative: at least one of the carriers has a departure delay that is different than the others.
“Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.”
- Roger Peng, Johns Hopkins
A full worked out example of this analysis is available here as an HTML file to view. The corresponding R Markdown file is available here.
We collected a random sample of the actual data on all 2014 flights departing PDX. Does a difference actually exist in the average departure delays for carriers in our population (all 2014 flights departing PDX)?
pdx_summary <- pdx_rs %>% group_by(carrier) %>%
summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_summary)
pdx_full_summary <- pdx_flights %>% group_by(carrier) %>%
summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_full_summary)
| carrier | Mean Delay | Median Delay |
|---|---|---|
| AA | 13.0708625 | -2 |
| AS | 0.9418523 | -5 |
| B6 | 5.9677926 | -3 |
| DL | 2.5678412 | -3 |
| F9 | 8.4546125 | -3 |
| HA | -0.8027397 | -5 |
| OO | 4.2595904 | -4 |
| UA | 7.3794427 | -2 |
| US | 1.5259545 | -3 |
| VX | 6.2477477 | -4 |
| WN | 12.1458352 | 1 |
Ratings from all beers I’ve rated using Untapped since February 2015
Use the group_by and summarize functions in the dplyr package along with appropriate plots using the ggplot2 package to understand which styles of beers I like best.
We are just doing data visualization and summary here (not inference)
To access the template file to begin your analysis, go to
- Code for slide creation on my GitHub page
- Slides available here
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_0.9.6 knitr_1.13 ggplot2_2.1.0 dplyr_0.4.3.9001
## [5] pnwflights14_0.1.0.9000 revealjs_0.6.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.5 magrittr_1.5 munsell_0.4.3 colorspace_1.2-6 R6_2.1.2
## [6] highr_0.6 stringr_1.0.0 plyr_1.8.3 tools_3.3.0 grid_3.3.0
## [11] gtable_0.2.0 DBI_0.4-1 htmltools_0.3.5 lazyeval_0.1.10 yaml_2.1.13
## [16] assertthat_0.1 digest_0.6.9 tibble_1.0-3 formatR_1.4 evaluate_0.9
## [21] labeling_0.3 stringi_1.0-1 scales_0.4.0